2023-09-12
These slides will be used in Class 05 and Class 06.
dm431 data, including blood pressuresbroom package will help us neaten model resultsnaniar package helps us identify missing valuestheme_light this time instead of theme_bw#| message: false in the code chunk to silence messages about conflicts between R packages.#| message: false?Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’: chisq.test, fisher.test
── Attaching core tidyverse packages ────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ lubridate 1.9.2 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.0
✔ readr 2.1.4
── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ tidyr::pack() masks Matrix::pack()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the conflicted package to force all conflicts to become errors
read_csv instead of read.csv here.
show_col_types = FALSE and you’ll get a message (see next slide.)#| message: false in the code chunk.show_col_types = FALSE
Rows: 431 Columns: 16
── Column specification ───────────────────────────────────────────────────────
Delimiter: ","
chr (6): CLASS5_ID, INSURANCE, TOBACCO, RACE_ETHNICITY, SEX, COUNTY
dbl (10): AGE, N_INCOME, HT, WT, SBP, DBP, A1C, LDL, STATIN, EYE_EXAM
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 431 × 16
CLASS5_ID AGE INSURANCE N_INCOME HT WT SBP DBP A1C LDL
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 S-001 57 Medicare 22139 1.71 91.2 120 79 6.2 148
2 S-002 63 Medicaid 39268 1.52 90.6 112 74 5.9 116
3 S-003 44 Commercial 56837 1.6 89.0 118 74 8 134
4 S-004 56 Uninsured 39962 1.7 88.9 140 80 14.3 42
5 S-005 38 Medicaid 40228 1.67 116. 156 118 7.8 96
6 S-006 56 Commercial 43782 1.6 100. 128 83 6 66
7 S-007 50 Medicaid 39574 1.69 80.9 136 60 6.3 110
8 S-008 49 Medicaid 38676 1.71 106. 120 77 11.9 129
9 S-009 47 Commercial 71494 1.67 74.2 121 82 10.4 101
10 S-010 38 Medicaid 11690 1.49 81.8 131 85 5.6 128
# ℹ 421 more rows
# ℹ 6 more variables: TOBACCO <chr>, STATIN <dbl>, EYE_EXAM <dbl>,
# RACE_ETHNICITY <chr>, SEX <chr>, COUNTY <chr>
dm431?Simulated data to match old Better Health Partnership specs.
AGE, SBP and DBP.
CLASS5_ID = identification code.dm431 variable names [1] "CLASS5_ID" "AGE" "INSURANCE" "N_INCOME"
[5] "HT" "WT" "SBP" "DBP"
[9] "A1C" "LDL" "TOBACCO" "STATIN"
[13] "EYE_EXAM" "RACE_ETHNICITY" "SEX" "COUNTY"
Variables we’ll use now: AGE, SBP and DBP, mostly.
Details on other variables to come later.
# A tibble: 3 × 16
CLASS5_ID AGE INSURANCE N_INCOME HT WT SBP DBP A1C LDL TOBACCO
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 S-001 57 Medicare 22139 1.71 91.2 120 79 6.2 148 Former
2 S-002 63 Medicaid 39268 1.52 90.6 112 74 5.9 116 Never
3 S-003 44 Commerci… 56837 1.6 89.0 118 74 8 134 Never
# ℹ 5 more variables: STATIN <dbl>, EYE_EXAM <dbl>, RACE_ETHNICITY <chr>,
# SEX <chr>, COUNTY <chr>
# A tibble: 2 × 16
CLASS5_ID AGE INSURANCE N_INCOME HT WT SBP DBP A1C LDL TOBACCO
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 S-430 59 Uninsured 60335 1.58 99.5 120 82 8.8 106 Never
2 S-431 51 Commerci… 23980 1.62 88.3 121 73 6.6 96 Former
# ℹ 5 more variables: STATIN <dbl>, EYE_EXAM <dbl>, RACE_ETHNICITY <chr>,
# SEX <chr>, COUNTY <chr>
dm431 glimpse (first few values)Rows: 431
Columns: 16
$ CLASS5_ID <chr> "S-001", "S-002", "S-003", "S-004", "S-005", "S-006", "…
$ AGE <dbl> 57, 63, 44, 56, 38, 56, 50, 49, 47, 38, 64, 56, 38, 47,…
$ INSURANCE <chr> "Medicare", "Medicaid", "Commercial", "Uninsured", "Med…
$ N_INCOME <dbl> 22139, 39268, 56837, 39962, 40228, 43782, 39574, 38676,…
$ HT <dbl> 1.71, 1.52, 1.60, 1.70, 1.67, 1.60, 1.69, 1.71, 1.67, 1…
$ WT <dbl> 91.22, 90.63, 88.96, 88.91, 115.76, 100.33, 80.88, 105.…
$ SBP <dbl> 120, 112, 118, 140, 156, 128, 136, 120, 121, 131, 125, …
$ DBP <dbl> 79, 74, 74, 80, 118, 83, 60, 77, 82, 85, 64, 73, 89, 71…
$ A1C <dbl> 6.2, 5.9, 8.0, 14.3, 7.8, 6.0, 6.3, 11.9, 10.4, 5.6, 6.…
$ LDL <dbl> 148, 116, 134, 42, 96, 66, 110, 129, 101, 128, 73, 114,…
$ TOBACCO <chr> "Former", "Never", "Never", "Former", "Current", "Forme…
$ STATIN <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0…
$ EYE_EXAM <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1…
$ RACE_ETHNICITY <chr> "Non-Hispanic Black", "Hispanic or Latinx", "Non-Hispan…
$ SEX <chr> "Female", "Female", "Female", "Female", "Female", "Fema…
$ COUNTY <chr> "Cuyahoga", "Cuyahoga", "Cuyahoga", "Cuyahoga", "Cuyaho…
insurance we might wind up analyzing from characters to factors.class5_id subject codes as characters.across(where()) syntax tells R to change everything that gives a TRUE response to “is this a character variable?” into a factor variable.class5_id to be a character so we don’t accidentally analyze it.clean_names() comes from the janitor package. —>clean_names() do?case parameter specifies preferences (default is snake)
clean_names(case = "snake") yields snake_caselowerCamelUpperCamelALL_CAPSdm431 data, version 2# A tibble: 431 × 16
class5_id age insurance n_income ht wt sbp dbp a1c ldl
<chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 S-001 57 Medicare 22139 1.71 91.2 120 79 6.2 148
2 S-002 63 Medicaid 39268 1.52 90.6 112 74 5.9 116
3 S-003 44 Commercial 56837 1.6 89.0 118 74 8 134
4 S-004 56 Uninsured 39962 1.7 88.9 140 80 14.3 42
5 S-005 38 Medicaid 40228 1.67 116. 156 118 7.8 96
6 S-006 56 Commercial 43782 1.6 100. 128 83 6 66
7 S-007 50 Medicaid 39574 1.69 80.9 136 60 6.3 110
8 S-008 49 Medicaid 38676 1.71 106. 120 77 11.9 129
9 S-009 47 Commercial 71494 1.67 74.2 121 82 10.4 101
10 S-010 38 Medicaid 11690 1.49 81.8 131 85 5.6 128
# ℹ 421 more rows
# ℹ 6 more variables: tobacco <fct>, statin <dbl>, eye_exam <dbl>,
# race_ethnicity <fct>, sex <fct>, county <fct>
dm431 codebook (part 1)| Variable | Description |
|---|---|
class5_id |
subject code (S-001 through S-431) |
age |
subject’s age, in years |
insurance |
primary insurance, 4 levels |
sbp |
most recent systolic blood pressure (mm Hg) |
dbp |
most recent diastolic blood pressure (mm Hg) |
n_income |
neighborhood median income, in $ |
dm431 codebook (part 2)| Variable | Description |
|---|---|
ht |
height, in meters (2 decimal places) |
wt |
weight, in kilograms (2 decimal places) |
a1c |
most recent Hemoglobin A1c (%, with one decimal) |
ldl |
most recent LDL cholesterol level (mg/dl) |
tobacco |
most recent tobacco status, 3 levels |
statin |
1 = prescribed a statin in past 12m, else 0 |
dm431 codebook (part 3)| Variable | Description |
|---|---|
eye_exam |
1 = diabetic eye exam in past 12m, else 0 |
race_ethnicity |
race/ethnicity category, 3 levels |
sex |
all subjects turn out to be Female |
county |
all subjects live in Cuyahoga County |
dm431 variable structuretibble [431 × 16] (S3: tbl_df/tbl/data.frame)
$ class5_id : chr [1:431] "S-001" "S-002" "S-003" "S-004" ...
$ age : num [1:431] 57 63 44 56 38 56 50 49 47 38 ...
$ insurance : Factor w/ 4 levels "Medicare","Medicaid",..: 1 2 3 4 2 3 2 2 3 2 ...
$ n_income : num [1:431] 22139 39268 56837 39962 40228 ...
$ ht : num [1:431] 1.71 1.52 1.6 1.7 1.67 1.6 1.69 1.71 1.67 1.49 ...
$ wt : num [1:431] 91.2 90.6 89 88.9 115.8 ...
$ sbp : num [1:431] 120 112 118 140 156 128 136 120 121 131 ...
$ dbp : num [1:431] 79 74 74 80 118 83 60 77 82 85 ...
$ a1c : num [1:431] 6.2 5.9 8 14.3 7.8 6 6.3 11.9 10.4 5.6 ...
$ ldl : num [1:431] 148 116 134 42 96 66 110 129 101 128 ...
$ tobacco : Factor w/ 3 levels "Former","Never",..: 1 2 2 1 3 1 2 2 1 2 ...
$ statin : num [1:431] 1 0 1 1 1 1 1 1 1 1 ...
$ eye_exam : num [1:431] 0 1 0 0 1 1 1 1 0 0 ...
$ race_ethnicity: Factor w/ 3 levels "Non-Hispanic Black",..: 1 2 1 1 1 3 1 3 1 1 ...
$ sex : Factor w/ 1 level "Female": 1 1 1 1 1 1 1 1 1 1 ...
$ county : Factor w/ 1 level "Cuyahoga": 1 1 1 1 1 1 1 1 1 1 ...
# A tibble: 1 × 3
n_miss_in_case n_cases pct_cases
<int> <int> <dbl>
1 0 431 100
Can also use other functions from the naniar package to understand and cope with missing values:
miss_var_summary() and miss_var_table()gg_miss_var() result —>dm431 tibbleSystolic blood pressure, the top number, measures the force the heart exerts on the walls of the arteries each time it beats. Diastolic blood pressure, the bottom number, measures the force the heart exerts on the walls of the arteries in between beats. (Mayo Clinic)
Question: What is the nature of the relationship between SBP and DBP in the dm431 data?
Call:
lm(formula = sbp ~ dbp, data = dm431)
Coefficients:
(Intercept) dbp
79.849 0.664
ggplot(data = dm431, aes(x = dbp, y = sbp)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, formula = y ~ x,
col = "blue") +
geom_smooth(method = "lm", se = FALSE, formula = y ~ x,
col = "red") +
labs(title = "Predicting Systolic BP using Diastolic BP",
subtitle = "with linear and loess fits",
caption = "431 women with diabetes from dm431",
y = "sbp = Systolic BP",
x = "dbp = Diastolic BP")sbp from dbp? age
Min. :30.0
1st Qu.:48.0
Median :54.0
Mean :52.9
3rd Qu.:59.0
Max. :64.0
min Q1 median Q3 max mean sd n missing
30 48 54 59 64 52.90023 7.993414 431 0
min, Q1, median, Q3 and maxsd) of the agesCan you envision the distribution of these ages?
dm431 Agesdm431 Agesage values?
The decimal point is at the |
30 | 0000
32 | 000000
34 | 000
36 | 000000
38 | 0000000000000
40 | 0000000000000
42 | 00000000000000000
44 | 000000000000000000000
46 | 0000000000000000000000
48 | 0000000000000000000000
50 | 00000000000000000000000000000000000000
52 | 00000000000000000000000000000000
54 | 000000000000000000000000000000000000000
56 | 0000000000000000000000000000000000000000000000000
58 | 000000000000000000000000000000000000000000000000000
60 | 00000000000000000000000000000000000
62 | 00000000000000000000000000000000000000000
64 | 0000000000000000000
$age
[1] 30 30 30 31 32 32 33 33 33 33 34 34 34 36 36 36 36 36 37 38 38 38 38 38 38
[26] 38 38 38 39 39 39 39 40 40 40 40 40 41 41 41 41 41 41 41 41 42 42 42 42 42
[51] 42 42 42 43 43 43 43 43 43 43 43 43 44 44 44 44 44 44 44 44 44 45 45 45 45
[76] 45 45 45 45 45 45 45 45 46 46 46 46 46 46 46 46 46 47 47 47 47 47 47 47 47
[101] 47 47 47 47 47 48 48 48 48 48 48 48 48 48 48 48 49 49 49 49 49 49 49 49 49
[126] 49 49 50 50 50 50 50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 51
[151] 51 51 51 51 51 51 51 51 51 51 51 51 51 51 51 52 52 52 52 52 52 52 52 52 52
[176] 52 52 52 52 52 52 53 53 53 53 53 53 53 53 53 53 53 53 53 53 53 53 54 54 54
[201] 54 54 54 54 54 54 54 54 54 54 54 54 54 54 54 54 55 55 55 55 55 55 55 55 55
[226] 55 55 55 55 55 55 55 55 55 55 55 56 56 56 56 56 56 56 56 56 56 56 56 56 56
[251] 56 56 56 56 56 56 56 56 56 56 56 57 57 57 57 57 57 57 57 57 57 57 57 57 57
[276] 57 57 57 57 57 57 57 57 57 57 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
[301] 58 58 58 58 58 58 58 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
[326] 59 59 59 59 59 59 59 59 59 59 59 60 60 60 60 60 60 60 60 60 60 60 60 60 60
[351] 60 60 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 62 62 62 62
[376] 62 62 62 62 62 62 62 62 62 62 62 62 62 63 63 63 63 63 63 63 63 63 63 63 63
[401] 63 63 63 63 63 63 63 63 63 63 63 63 64 64 64 64 64 64 64 64 64 64 64 64 64
[426] 64 64 64 64 64 64
psych::describe() vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 431 52.9 7.99 54 53.6 7.41 30 64 34 -0.72 -0.16 0.39
trimmed = mean of middle 80% of datamad = median absolute deviation (measures spread)se = standard error of the mean \(= {sd} / {\sqrt{n}}\)skew and kurtosis not so important todayHmisc::describe()select(dm431, age)
1 Variables 431 Observations
--------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10
431 0 34 0.998 52.9 8.944 38 41
.25 .50 .75 .90 .95
48 54 59 62 63
lowest : 30 31 32 33 34, highest: 60 61 62 63 64
--------------------------------------------------------------------------------
distinct, Info, Gmd —>Hmisc::describe elementsHmisc::describe treats a numeric variable as discrete if it has 10 or fewer distinct valuesInfo is related to how “continuous” the variable is - it’s a relative measure of the available information that is reduced below 1 by ties or non-distinct valuesGmd = Gini’s mean difference measures dispersion (spread). It is the mean absolute difference between any pairs of the 431 observations. Pronounced “Ginny”.Note: Is this actually what I want?
age Medicare Medicaid Commercial Uninsured
30 0 2 0 1
31 0 0 1 0
32 1 1 0 0
33 0 4 0 0
34 0 2 0 1
36 0 4 1 0
37 0 0 1 0
38 0 8 1 0
39 0 4 0 0
40 1 2 1 1
41 1 4 3 0
42 0 4 4 0
43 1 4 3 1
44 3 2 4 0
45 1 5 4 2
46 1 5 2 1
47 3 7 3 0
48 2 5 3 1
49 1 8 1 1
50 2 9 2 0
51 7 7 9 2
52 1 6 5 4
53 2 10 4 0
54 7 4 8 0
55 5 9 4 2
56 1 12 9 3
57 6 8 10 0
58 4 11 4 3
59 8 16 4 1
60 5 8 2 1
61 6 6 6 1
62 8 5 4 0
63 10 6 6 2
64 13 3 2 1
Do these results make sense to you?
insurance min Q1 median Q3 max mean sd n missing
1 Medicare 32 52.75 58.5 62.0 64 56.59000 6.751124 100 0
2 Medicaid 30 46.00 53.0 58.0 64 51.20419 8.385776 191 0
3 Commercial 31 47.50 54.0 57.5 64 52.69369 7.212102 111 0
4 Uninsured 30 48.00 52.0 58.0 64 52.13793 8.339768 29 0
dm431 |> group_by(insurance) |>
summarize(n = n(), mean = round_half_up(mean(age), 1),
median = median(age), sd = round_half_up(sd(age),2),
skew1 = round_half_up( (mean - median) / sd, 2))# A tibble: 4 × 6
insurance n mean median sd skew1
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Medicare 100 56.6 58.5 6.75 -0.28
2 Medicaid 191 51.2 53 8.39 -0.21
3 Commercial 111 52.7 54 7.21 -0.18
4 Uninsured 29 52.1 52 8.34 0.01
dm431 (dotplot)dm431 Systolic BPp1 <- ggplot(data = dm431, aes(x = sbp)) +
geom_histogram(bins = 5, fill = "seagreen",
col = "white") +
labs(title = "Five bins")
# omitting the code for plots p2-p4 in this slide,
# use bins = 10, 15 and 20, respectively, and use
# tomato, salmon and slateblue for fill, respectively
(p1 + p2) / (p3 + p4) +
plot_annotation(
title = "431 SBP values for women with diabetes")dm431 Diastolic BPBy a Normal model, we mean that the data are assumed to be the result of selecting at random from a probability distribution called the Normal (or Gaussian) distribution, which is characterized by a bell-shaped curve.
We evaluate whether a Normal model fits sufficiently well to our data on the basis of (in order of importance):
We evaluate whether a Normal model fits sufficiently well to our data on the basis of (in order of importance):
We evaluate whether a Normal model fits sufficiently well to our data on the basis of (in order of importance):
What would a sample of 431 systolic blood pressures from a Normal distribution look like?
dm431 systolic BPs.p1 <- ggplot(data = dm431, aes(x = sbp)) +
geom_histogram(binwidth = 5, fill = "navy", col = "gold") +
scale_x_continuous(limits = c(70, 200),
breaks = c(80, 100, 120, 140, 160, 180)) +
labs(title = "431 Observed SBP values from dm431 (mean = 128.8, sd = 16.3)")
p2 <- ggplot(sim_data, aes(x = sbp)) +
geom_histogram(binwidth = 5, fill = "deeppink", col = "black") +
scale_x_continuous(limits = c(70, 200),
breaks = c(80, 100, 120, 140, 160, 180)) +
labs(title = "431 Simulated Values from Normal model with same mean and SD")
p1 / p2dm431 SBP histogram as densitySuppose we want to rescale the histogram counts so that the bar areas integrate to 1.
dm431 SBP histogram as densityNow we can draw a Normal density curve on top of the rescaled histogram of systolic blood pressures.
dm431 SBPdm431 SBPp1 <- ggplot(dm431, aes(x = "", y = sbp)) +
geom_violin(fill = "lemonchiffon") +
geom_boxplot(width = 0.3, fill = "royalblue",
outlier.size = 3,
outlier.color = "royalblue") +
lims(y = c(70, 200)) +
coord_flip() +
labs(x = "dm431 sample",
title = "Observed SBP values")
p2 <- ggplot(sim_data, aes(x = "", y = sbp)) +
geom_violin(fill = "cornsilk") +
geom_boxplot(width = 0.3, fill = "deeppink",
outlier.size = 3,
outlier.color = "deeppink") +
lims(y = c(70, 200)) +
coord_flip() +
labs(x = "Simulated SBP data",
title = "Simulated SBP from Normal distribution")
p1 / p2dm431 DBPRemember that these are draws from a Normal distribution, so this is what a sample of 431 Normally distributed data points should look like.
Tool to help assess whether the distribution of a single sample is well-modeled by the Normal.
Given a sample of size N, R calculates what the minimum value would be expected to be for a standard Normal distribution (a Normal with mean 0 and standard deviation 1.) Then it calculates what the next smallest value would be, and so forth all the way up to the maximum value.
For our simulated blood pressure data
The Normal Q-Q plot can help us identify data as well approximated by a Normal distribution, or not, because of:
Examples coming up next —>
We’ll next build useful visualizations and numeric summaries, describing…
# code for Example 1
# plotting data sampled from
# a Normal distribution
set.seed(431)
example1 <- rnorm(n = 500, mean = 100, sd = 10)
sim_study <- tibble(example1)
p1 <- ggplot(sim_study, aes(sample = example1)) +
geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: Example 1")
p2 <- ggplot(sim_study, aes(x = example1)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "dodgerblue", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sim_study$example1),
sd = sd(sim_study$example1)),
col = "navy", lwd = 1.5) +
labs(title = "Density Function: Example 1")
p3 <- ggplot(sim_study, aes(x = example1, y = "")) +
geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") +
labs(title = "Boxplot: Example 1", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Example 1. Sample from Normal model")Is a Normal Q-Q plot showing something close to a straight line, without clear signs of skew or indications of lots of outliers (heavy-tailedness)?
Does a boxplot, violin plot and/or histogram also show a symmetric distribution, where both the number of outliers is modest, and the distance of those outliers from the mean is modest?
Do numerical measures match up with the expectations of a normal model?
# code for Example 2
# plotting data sampled from
# a left-skewed distribution
set.seed(431)
sim_study$example2 <- rbeta(n = 500, shape = 2, shape2 = 5, ncp = 100)
p1 <- ggplot(sim_study, aes(sample = example2)) +
geom_qq(col = "darkorchid1") + geom_qq_line(col = "blue") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: Example 2")
p2 <- ggplot(sim_study, aes(x = example2)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "darkorchid1", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sim_study$example2),
sd = sd(sim_study$example2)),
col = "blue", lwd = 1.5) +
labs(title = "Density Function: Example 2")
p3 <- ggplot(sim_study, aes(x = example2, y = "")) +
geom_boxplot(fill = "darkorchid1", outlier.color = "darkorchid1") +
labs(title = "Boxplot: Example 2", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Example 2. Left-Skewed Sample")# code for Example 3
# plotting data sampled from
# a right-skewed distribution
set.seed(431)
sim_study$example3 <- exp(rnorm(n = 500, mean = 1, sd = 0.5))
p1 <- ggplot(sim_study, aes(sample = example3)) +
geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: Example 3")
p2 <- ggplot(sim_study, aes(x = example3)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "dodgerblue", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sim_study$example3),
sd = sd(sim_study$example3)),
col = "navy", lwd = 1.5) +
labs(title = "Density Function: Example 3")
p3 <- ggplot(sim_study, aes(x = example3, y = "")) +
geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") +
labs(title = "Boxplot: Example 3", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Example 3. Right-Skewed Sample")# code for Example 4
# plotting data sampled from
# a symmetric, but discrete distribution
set.seed(431)
sim_study$example4 <- rpois(n = 500, lambda = 8)
p1 <- ggplot(sim_study, aes(sample = example4)) +
geom_qq(col = "orangered") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: Example 4")
p2 <- ggplot(sim_study, aes(x = example4)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "orangered", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sim_study$example4),
sd = sd(sim_study$example4)),
col = "black", lwd = 1.5) +
labs(title = "Density Function: Example 4")
p3 <- ggplot(sim_study, aes(x = example4, y = "")) +
geom_boxplot(fill = "orangered", outlier.color = "orangered") +
labs(title = "Boxplot: Example 4", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Example 4. Symmetric Discrete Sample")# code for Example 5
# plotting data sampled from
# a uniform distribution
set.seed(431)
sim_study$example5 <- runif(n = 500, min = 0, max = 100)
p1 <- ggplot(sim_study, aes(sample = example5)) +
geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: Example 5")
p2 <- ggplot(sim_study, aes(x = example5)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "dodgerblue", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sim_study$example5),
sd = sd(sim_study$example5)),
col = "navy", lwd = 1.5) +
scale_x_continuous(limits = c(0, 100)) +
labs(title = "Density Function: Example 5")
p3 <- ggplot(sim_study, aes(x = example5, y = "")) +
geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") +
labs(title = "Boxplot: Example 5", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Example 5. Uniform Sample")# code for Example 6
# plotting data sampled from
# a symmetric, but heavy-tailed (outlier-prone) distribution
set.seed(431)
sim_study$example6 <- rnorm(n = 500, mean = 50, sd = 10)
sim_study$example6[14] <- 5
sim_study$example6[15] <- 3
sim_study$example6[39] <- 93
sim_study$example6[38] <- 97
p1 <- ggplot(sim_study, aes(sample = example6)) +
geom_qq(col = "orangered") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: Example 6")
p2 <- ggplot(sim_study, aes(x = example6)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "orangered", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(sim_study$example6),
sd = sd(sim_study$example6)),
col = "black", lwd = 1.5) +
labs(title = "Density Function: Example 6")
p3 <- ggplot(sim_study, aes(x = example6, y = "")) +
geom_boxplot(fill = "orangered", outlier.color = "orangered") +
labs(title = "Boxplot: Example 6", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Example 6. Symmetric, Outlier-Prone")dm431It is usually helpful to focus on the shape, center and spread of a distribution.
Bock, Velleman and DeVeaux suggest:
If the data are skewed, report the median and IQR (or the three middle quantiles).
If the data are symmetric, report the mean and standard deviation, and possibly the median and IQR as well.
If there are clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed.
sbpShould we focus on, in light of our visualizations?
min Q1 median Q3 max mean sd n missing
88 119 128 138 191 128.7889 16.33058 431 0
select(dm431, sbp)
1 Variables 431 Observations
--------------------------------------------------------------------------------
sbp
n missing distinct Info Mean Gmd .05 .10
431 0 79 0.999 128.8 18.16 102 108
.25 .50 .75 .90 .95
119 128 138 149 157
lowest : 88 90 92 95 96, highest: 175 176 179 180 191
--------------------------------------------------------------------------------
dm431dbp values?
The decimal point is 1 digit(s) to the right of the |
4 | 68
5 | 000022223334444455566677778888888899
6 | 0000000000001111111222222222233333333344444+58
7 | 0000000000000000111111111122222222222223333+83
8 | 0000000000000000000000000001111111111122222+64
9 | 00000012224445567
10 | 0002
11 | 88
dbp values?Which are the subjects with unusual values of dbp?
dbp?Which summaries seem most useful for the dm431 dbp data?
min Q1 median Q3 max mean sd n missing
46 66 74 81 118 73.70766 10.77089 431 0
dm431$dbp
n missing distinct Info Mean Gmd .05 .10
431 0 51 0.999 73.71 12.08 56 60
.25 .50 .75 .90 .95
66 74 81 86 90
lowest : 46 48 50 52 53, highest: 96 97 100 102 118
If a Normal model fits our data well, then we should see the following graphical indications:
dm431?dm431: Neighborhood Incomedm431: Natural Logarithm of Incomedm431 <- dm431 |> mutate(log_inc = log(n_income))
p1 <- ggplot(dm431, aes(sample = log_inc)) +
geom_qq(col = "seagreen") + geom_qq_line(col = "red") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: log(dm431 Income)")
p2 <- ggplot(dm431, aes(x = log_inc)) +
geom_histogram(aes(y = stat(density)),
bins = 20, fill = "seagreen", col = "ivory") +
stat_function(fun = dnorm,
args = list(mean = mean(dm431$log_inc),
sd = sd(dm431$log_inc)),
col = "red", lwd = 1.5) +
labs(title = "Density Function: log(dm431 Income)")
p3 <- ggplot(dm431, aes(x = log_inc, y = "")) +
geom_boxplot(fill = "seagreen", outlier.color = "seagreen") +
labs(title = "Boxplot: log(dm431 Income)", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Observed log(n_income) in dm431")dm431: Natural Logarithm of Incomedm431: Base-10 Logarithm of Incomedm431 <- dm431 |> mutate(log10_inc = log10(n_income))
p1 <- ggplot(dm431, aes(sample = log10_inc)) +
geom_qq(col = "seagreen") + geom_qq_line(col = "red") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot: log10(n_income)")
p2 <- ggplot(dm431, aes(x = log10_inc)) +
geom_histogram(aes(y = stat(density)),
bins = 20, fill = "seagreen", col = "ivory") +
stat_function(fun = dnorm,
args = list(mean = mean(dm431$log10_inc),
sd = sd(dm431$log10_inc)),
col = "red", lwd = 1.5) +
labs(title = "Density Function: log10(n_income)")
p3 <- ggplot(dm431, aes(x = log10_inc, y = "")) +
geom_boxplot(fill = "seagreen", outlier.color = "seagreen") +
labs(title = "Boxplot: log10(n_income)", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1))) +
plot_annotation("Observed log10(n_income) in dm431")dm431: Base-10 Logarithm of Income min Q1 median Q3 max mean sd n missing
88 119 128 138 191 128.7889 16.33058 431 0
min Q1 median Q3 max mean sd n missing
81.41991 116.0579 128.777 139.1657 175.9422 128.2372 16.05684 431 0
What can we learn from these comparisons…
The least important approach (even though it is seemingly the most objective) is the calculation of various numerical summaries.
Semi-useful summaries help us understand whether they match up well with the expectations of a normal model:
\[ skew_1 = \frac{mean - median}{standard \ deviation} \]
dm431 SBP? min Q1 median Q3 max mean sd n missing
88 119 128 138 191 128.7889 16.33058 431 0
# A tibble: 1 × 1
skew1
<dbl>
1 0.0483
What does this suggest?
dm431 variables| Variable | \(\bar{x}\) = mean | median | \(s\) = SD | \(skew_1\) |
|---|---|---|---|---|
sbp |
128.8 | 128 | 16.3 | 0.05 |
dbp |
73.7 | 74 | 10.8 | -0.03 |
age |
52.9 | 54 | 8 | -0.14 |
ldl |
110.5 | 104 | 38.3 | 0.17 |
n_income |
3.2514^{4} | 2.7903^{4} | 1.7295^{4} | 0.27 |
dm431p1a <- ggplot(dm431, aes(x = sbp)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "tomato", col = "black") +
stat_function(fun = dnorm,
args = list(mean = mean(dm431$sbp),
sd = sd(dm431$sbp)),
col = "red", lwd = 1.5) +
theme(aspect.ratio = 1) +
labs(title = "Systolic BP")
p1b <- ggplot(dm431, aes(x = dbp)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "seagreen", col = "black") +
stat_function(fun = dnorm,
args = list(mean = mean(dm431$dbp),
sd = sd(dm431$dbp)),
col = "red", lwd = 1.5) +
theme(aspect.ratio = 1) +
labs(title = "Diastolic BP")
p1c <- ggplot(dm431, aes(x = age)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "royalblue", col = "black") +
stat_function(fun = dnorm,
args = list(mean = mean(dm431$age),
sd = sd(dm431$age)),
col = "red", lwd = 1.5) +
theme(aspect.ratio = 1) +
labs(title = "Age")
p1d <- ggplot(dm431, aes(x = ldl)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "chocolate", col = "black") +
stat_function(fun = dnorm,
args = list(mean = mean(dm431$ldl),
sd = sd(dm431$ldl)),
col = "red", lwd = 1.5) +
theme(aspect.ratio = 1) +
labs(title = "LDL Cholesterol")
p1e <- ggplot(dm431, aes(x = n_income)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "darkcyan", col = "black") +
stat_function(fun = dnorm,
args = list(mean = mean(dm431$n_income),
sd = sd(dm431$n_income)),
col = "red", lwd = 1.5) +
theme(aspect.ratio = 1) +
labs(title = "Neighborhood Income")
(p1a + p1b + p1c)/(p1d + p1e)dm431Remember that, regardless of the distribution of the data:
If the data followed a Normal distribution, perfectly, then about:
# A tibble: 3 × 3
`sbp > mean(sbp) - sd(sbp)` `sbp < mean(sbp) + sd(sbp)` n
<lgl> <lgl> <int>
1 FALSE TRUE 70
2 TRUE FALSE 55
3 TRUE TRUE 306
# A tibble: 3 × 3
`sbp > mean(sbp) - 2 * sd(sbp)` `sbp < mean(sbp) + 2 * sd(sbp)` n
<lgl> <lgl> <int>
1 FALSE TRUE 5
2 TRUE FALSE 15
3 TRUE TRUE 411
dm431| Variable | \(\bar{x}\) | \(s\) = SD | \(\bar{x} \pm s\) | \(\bar{x} \pm 2s\) | \(\bar{x} \pm 3s\) |
|---|---|---|---|---|---|
sbp |
128.8 | 16.3 | 71% | 95.4% | 99.3% |
dbp |
73.7 | 10.8 | 71% | 95.8% | 99.5% |
age |
52.9 | 8 | 65.2% | 95.8% | 100% |
ldl |
110.5 | 38.3 | 68.4% | 95.4% | 99.5% |
n_income |
32514 | 17295 | 72.2% | 95.1% | 98.4% |
dm431p1a <- ggplot(dm431, aes(sample = sbp)) +
geom_qq(col = "tomato") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "Systolic BP")
p1b <- ggplot(dm431, aes(sample = dbp)) +
geom_qq(col = "seagreen") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "Diastolic BP")
p1c <- ggplot(dm431, aes(sample = age)) +
geom_qq(col = "royalblue") + geom_qq_line(col = "red") +
theme(aspect.ratio = 1) +
labs(title = "Age")
p1d <- ggplot(dm431, aes(sample = ldl)) +
geom_qq(col = "chocolate") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "LDL Cholesterol")
p1e <- ggplot(dm431, aes(sample = n_income)) +
geom_qq(col = "darkcyan") + geom_qq_line(col = "red") +
theme(aspect.ratio = 1) +
labs(title = "Neighborhood Income")
(p1a + p1b + p1c)/(p1d + p1e)dm431Don’t. Graphical approaches are far better than tests…
Shapiro-Wilk normality test
data: dm431$sbp
W = 0.98636, p-value = 0.0004525
The very small p value indicates that the test finds some indications against adopting a Normal model for these data.
Exciting, huh? Alas, not actually useful.
The nortest package, which I don’t even install as part of our 431 packages, includes many other possible tests of Normality for a batch of data, including:
nortest::ad.test(dm431$sbp) Anderson-Darling testnortest::lillie.test(dm431$sbp) Lilliefors testnortest::cvm.test(dm431$sbp) Cramer-von Mises testnortest::sf.test(dm431$sbp) Shapiro-Francia testnortest::pearson.test(dm431$sbp) Pearson chi-square testThere are multiple hypothesis testing schemes and each looks for one specific violation of a Normality assumption.
Whenever you can avoid hypothesis testing and instead actually plot the data, you should plot the data.
Sometimes you can’t plot (especially with really big data) but the test should be your very last resort.
Do we have…
Big issue: why do you need to assume a Normal model?
ggplot(data = dm431, aes(x = sbp, y = insurance)) +
geom_violin(aes(fill = insurance)) +
geom_boxplot(width = 0.3, notch = TRUE, outlier.size = 3) +
scale_fill_brewer() +
guides(fill = "none", col = "none") +
labs(title = "Systolic Blood Pressure by Insurance Type",
caption = "dm431 tibble",
y = "", x = "Systolic BP (mm Hg)") insurance min Q1 median Q3 max mean sd n missing
1 Medicare 92 120.0 129.5 138.25 191 130.5700 17.89885 100 0
2 Medicaid 90 118.5 130.0 138.00 179 128.8325 16.12919 191 0
3 Commercial 97 119.0 128.0 138.00 173 128.6216 14.50834 111 0
4 Uninsured 88 110.0 122.0 134.00 165 123.0000 18.01190 29 0
mod1 <- lm(sbp ~ insurance, data = dm431)
anova(mod1) # only makes sense if comparing means makes senseAnalysis of Variance Table
Response: sbp
Df Sum Sq Mean Sq F value Pr(>F)
insurance 3 1293 430.84 1.6226 0.1834
Residuals 427 113383 265.53
eye_exam to a factorAlso, we want to create levels Yes (1) and No (0).
ggplot(data = dm431, aes(x = eye_ex, y = n_income)) +
geom_violin(aes(fill = eye_ex)) +
geom_boxplot(width = 0.3) +
scale_fill_viridis_d(begin = 0.7, option = "A") +
guides(fill = "none", col = "none") +
labs(title = "Neighborhood Income by Eye Exam Status",
caption = "dm431 tibble",
y = "Eye Exam?", x = "Neighborhood Income ($)") eye_ex min Q1 median Q3 max mean sd n missing
1 No 7534 19013.00 26521.0 42788.00 102672 32270.93 18310.77 169 0
2 Yes 8641 21785.75 28803.5 39484.75 100437 32670.58 16640.23 262 0
What if we fit a regression model here?
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 32271. 1332. 24.2 2.37e-82 30076. 34466.
2 eye_exYes 400. 1708. 0.234 8.15e- 1 -2416. 3215.
Two Sample t-test
data: n_income by eye_ex
t = -0.23396, df = 429, p-value = 0.8151
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
90 percent confidence interval:
-3215.431 2416.133
sample estimates:
mean in group No mean in group Yes
32270.93 32670.58
Does a t test make any sense given non-Normal distribution of n_income?
ggplot(data = dm431, aes(x = eye_ex, y = log10(n_income))) +
geom_violin(aes(fill = eye_ex)) +
geom_boxplot(width = 0.3) +
scale_fill_viridis_d(begin = 0.5, option = "B") +
guides(fill = "none", col = "none") +
labs(title = "log10(Income) by Eye Exam Status",
caption = "dm431 tibble",
y = "Eye Exam?", x = "Base-10 Log of Income") eye_ex min Q1 median Q3 max mean sd n
1 No 3.877026 4.279051 4.423590 4.631322 5.011452 4.442798 0.2426069 169
2 Yes 3.936564 4.338172 4.459439 4.596428 5.001894 4.462694 0.2130671 262
missing
1 0
2 0
What if we fit a regression model here?
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.44 0.0173 257. 0 4.41 4.47
2 eye_exYes 0.0199 0.0222 0.896 0.371 -0.0167 0.0565
Two Sample t-test
data: log10(n_income) by eye_ex
t = -0.89589, df = 429, p-value = 0.3708
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
90 percent confidence interval:
-0.05650466 0.01671222
sample estimates:
mean in group No mean in group Yes
4.442798 4.462694
Is using this t test on log10(income) a sensible choice?
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_2.0.3 mosaicData_0.20.3 ggformula_0.10.4 Matrix_1.6-1
[5] lattice_0.21-8 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[9] dplyr_1.1.2 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[13] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0 patchwork_1.1.3
[17] naniar_1.0.0 janitor_2.2.0 broom_1.0.5
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 psych_2.3.6 viridisLite_0.4.2 farver_2.1.1
[5] fastmap_1.1.1 tweenr_2.0.2 rpart_4.1.19 labelled_2.12.0
[9] digest_0.6.33 timechange_0.2.0 lifecycle_1.0.3 cluster_2.1.4
[13] ellipsis_0.3.2 compiler_4.3.1 rlang_1.1.1 Hmisc_5.1-0
[17] tools_4.3.1 utf8_1.2.3 yaml_2.3.7 data.table_1.14.8
[21] knitr_1.43 htmlwidgets_1.6.2 labeling_0.4.3 bit_4.0.5
[25] mnormt_2.1.1 ggstance_0.3.6 RColorBrewer_1.1-3 withr_2.5.0
[29] foreign_0.8-84 nnet_7.3-19 grid_4.3.1 polyclip_1.10-4
[33] mosaicCore_0.9.2.1 fansi_1.0.4 colorspace_2.1-0 scales_1.2.1
[37] MASS_7.3-60 ggridges_0.5.4 cli_3.6.1 rmarkdown_2.24
[41] crayon_1.5.2 generics_0.1.3 rstudioapi_0.15.0 tzdb_0.4.0
[45] mosaic_1.8.4.2 ggforce_0.4.1 splines_4.3.1 parallel_4.3.1
[49] base64enc_0.1-3 vctrs_0.6.3 jsonlite_1.8.7 hms_1.1.3
[53] bit64_4.0.5 visdat_0.6.0 htmlTable_2.4.1 Formula_1.2-5
[57] glue_1.6.2 stringi_1.7.12 gtable_0.3.4 munsell_0.5.0
[61] pillar_1.9.0 htmltools_0.5.6 R6_2.5.1 vroom_1.6.3
[65] evaluate_0.21 haven_2.5.3 backports_1.4.1 snakecase_0.11.1
[69] Rcpp_1.0.11 checkmate_2.2.0 gridExtra_2.3 nlme_3.1-162
[73] mgcv_1.8-42 xfun_0.40 pkgconfig_2.0.3
431 Classes 05 (& 06) | 2023-09-12 | https://thomaselove.github.io/431-2023/